langid.py: An Off-the-shelf Language Identification Tool
نویسندگان
چکیده
We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.
منابع مشابه
Langid.py for Better Language Modelling
Large corpora are crucial resources for building many statistical language technology systems, and the Web is a readilyavailable source of vast amounts of linguistic data from which to construct such corpora. Nevertheless, little research has considered how to best build corpora from the Web. In this study we consider the importance of language identification in Web corpus construction. Beginni...
متن کاملUsing Off-the-Shelf Formal Methods to Verify Attribute Grammar Properties
Attribute Grammars are the specification language of many tools that automatically generate programming language implementations. We consider the problem of verifying properties of attribute grammar specifications, particularly properties that are not well supported by existing tools. Rather than propose methods for extending existing tool implementation techniques, we propose the use of off-th...
متن کاملReal Time Processing of Hyperspectral Images
We describe the development of a real-time processing tool for hyperspectral imagery based on off-the-shelf equipment and higher level programming language implementation (C++ and Java). The algorithms we developed are derived from previously introduced spectra matching and feature extraction tools. The first group is based on spectra identification and spectral screening, a method that allows ...
متن کاملDevelopment of a smart wireless sensing unit using off - the - shelf FPGA hardware and programming products
In this study, Field-Programmable Gate Arrays (FPGAs) are investigated as a practical solution to the challenge of designing an optimal platform for implementing algorithms in a wireless sensing unit for structural health monitoring. Inherent advantages, such as tremendous processing power, coupled with reconfigurable and flexible architecture render FPGAs a prime candidate for the processing c...
متن کاملDevelopment of a New Vernacular Tool for Diagnosis of Alcohol Dependence in the Emergency
Background: Alcohol dependence (AD) is a major reason for morbidity and visits to emergency medical settings. However, the detection of AD is often difficult or overlooked. This study aimed to develop a brief screening questionnaire in Hindi language for detection of AD in an emergency medical setting. Methods: The authors in consultation devised a set of questions related to AD in the Hindi l...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012